As I discussed earlier that zero-dimensional points are essential for geographical features like wells, soil-sampling locations that can be best expressed by a single reference point. Often point vector data consist of several points into a multi-point structure, with a single attribute record. For example, all the soil sampling points could be considered as a single geometry. The main drawback of a point feature, however, is that they cannot be used to make measurements (as you can with a polygon)
In this section, we will use 650 soil samples from Colorado, Kansas, New Mexico, and Wyoming. These samples were collected by the United States Geological Survey (USGS) as a part of the USGS Geochemical Landscapes Project [Smith et al., 2011]. We will learn how to join spatial and non-spatial data and create a dataframe for spatial interpolation.
This exercise consists of following major steps:
Part-1: Create a Spatial Point Data Frame
We will use following spatial data to create the data frame. The data could be download from here.
GP_ID.: Containts FIPS code, State and County names
Soil sampling locations (GP_GPS.CSV): This file contains coordinates of 473 soil sampling sites in Colorado, Kansas, New Mexico, and Wyoming. The soil samples were collected by United States Geological Survey (USGS) throughout the A horizon with sampling densities of 1 sample per ~1600 km2 (Smith et al., 2011).
Soil organic carbon (SOC) (GP_SOC.csv): SOC concentration of these 473 sites. The concentration of SOC was predicted by means of mid-infrared (MIR) spectroscopy and partial least squares regression (PLSR) analysis described previously (Janik et al., 2007; Ahmed et al., 2017).
Environmental covariables (gp_point_data.csv)
Prediction grid data
import seaborn as sns
sns.set_style("white")
sns.set(font_scale=1.5)
## Create Woking directory
import os
path= "E:\GitHub\geospatial-python-github.io\geospatial-python.io\Lesson_07_working_with_point_data"
os.chdir(path)
print("Current Working Directory " , os.getcwd())
import pandas as pd
# Read ID file
f_id = 'Data_07/GP_ID.csv'
id_df = pd.read_csv(f_id)
id_df.head()
import pandas as pd
# Read GPS of soil sampling locations
f_gps = 'Data_07/GP_GPS.csv'
gps = pd.read_csv(f_gps)
gps.head()
import pandas as pd
# Soil OC data
f_soc = 'Data_07/GP_SOC.csv'
soc = pd.read_csv(f_soc)
soc.head()
import pandas as pd
f_co = 'Data_07/gp_point_data.csv'
co_var = pd.read_csv(f_co)
# Convert NLCD to integer
co_var['NLCD']=co_var['NLCD'].astype(int)
co_var.head()
import pandas as pd
f_nlcd = 'Data_07/NLCD_ID.csv'
nlcd_id = pd.read_csv(f_nlcd)
nlcd_id.head()
A GeoDataFrame needs a shapely object. We use geopandas points_from_xy() to transform Longitude and Latitude into a list of shapely.Point objects and set it as a geometry while creating the GeoDataFrame. (note that points_from_xy() is an enhanced wrapper for [Point(x, y) for x, y in zip(df.Longitude, df.Latitude)])
import geopandas as gpd
gps_point = gpd.GeoDataFrame(
gps, geometry=gpd.points_from_xy(gps.long, gps.lat))
gps_point.head()
# Plot data
gps_point.plot()
We will transform point shape file from WGS 1984 to Albers Equal Area Conic NAD1983 (EPSG:5070). We could reproject the data from one CRS to another CRS using the geopandas method (i.e. function belonging to geopandas objects): to_crs(). Function .to_crs() will only work if your original spatial object has a CRS assigned to it AND if that CRS is the correct CRS for that data!
print(gps_point.crs)
We have define the projection before change CRS to projected coordinatesystem. We assign the WGS84 latitude-longitude CRS (EPSG:4326) to the crs attribute:
# Define CRS
gps_point.crs = ({'init' :'epsg:4326'})
Now we will repoject gps_point data using the CRS of GP_STATE shape file. First we extract CRS from State shape file
import geopandas as gpd
# US State shape files
fn="Data_07/GP_STATE.shp"
GP_STATE= gpd.read_file(fn)
# CRS of state polygon
GP_crs=GP_STATE.crs
GP_crs
# Repojection to epsg:5070 projection
point_proj=gps_point.to_crs(GP_crs)
import matplotlib.pyplot as plt
# Make subplots that are next to each other
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 8))
# Plot the data in WGS84 CRS
gps_point.plot(ax=ax1, facecolor='gray');
# Add title
ax1.set_title("WGS84");
# Plot the one
point_proj.plot(ax=ax2, facecolor='blue');
# Add title
ax2.set_title("Albers Equal Area Conic");
# Remove empty white space around the plot
plt.tight_layout()
Attribute joins are accomplished using the merge() function. In general, it is recommended to use the merge method called from the spatial dataset. The stand-alone merge() function will work if the GeoDataFrame is in the left argument; if a DataFrame is in the left argument and a GeoDataFrame is in the right position, the result will no longer be a GeoDataFrame.
First we will create PD frame with all non-spatial data. We will use pd.merge() function to create a dataframe.
from functools import reduce
df1 = [id_df, soc,co_var]
df2 = reduce(lambda left,right: pd.merge(left,right,on='SiteID'), df1)
df2
Now we add NLCD Description column to the data-frame
from functools import reduce
d3 = [df2, nlcd_id]
df = reduce(lambda left,right: pd.merge(left,right,on='NLCD'), d3)
df
Now, we will merge GeoDataFrame (point_proj) with a pandas DataFrame (df) by SiteID of each each sampling location. Be
point_spdf = point_proj.merge(df, on='SiteID')
point_spdf.head()
point_spdf.to_file("Data_07/gp_point_all_data.shp")
Now we will extract projected geometry (m) from the spatial point data-frame.
# Get X (m) and Y (m) Values from geometry column
point_spdf['x'] = point_spdf.geometry.x
point_spdf['y'] = point_spdf.geometry.y
point_spdf.head()
Now we will convert geopandas data frame to pandas data frame without geometry column. We will use this data-frame for futher analysis.
import pandas as pd
point_df = pd.DataFrame(point_spdf.drop(columns=['geometry']))
# Organize variables
point_mf= point_df[['SiteID', 'long', 'lat', 'x', 'y', 'FIPS','STATE', 'COUNTY',
'SOC', 'DEM','Slope','MAP', 'MAT','NDVI','NLCD','NLCD_DES']]
point_mf.head()
point_mf.to_csv('Data_07/gp_all_data.csv', header =True, index=False)
import seaborn as sns
p, ax = plt.subplots(figsize=(10, 6))
ax=sns.boxplot(y="SOC",
x="NLCD_DES",
data=point_mf,
palette="Blues_d")
ax.axes.xaxis.label.set_text("NLCD")
ax.axes.yaxis.label.set_text("Soil OC (mg/g)")
#plt.title('Mean ± SE of Soil OC')
import plotly.express as px
fig = px.box(point_mf,
x="NLCD_DES",
y="SOC",
width=800,
height=500)
fig.update_layout(
title={
'text': "Sourface Soil Organic (mg/g)",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="NLCD",
yaxis_title="Soil OC",
font=dict(
family="sans-serif, monospace",
size=18,
color="#7f7f7f"))
fig.show()
# short data-set
mf_soc =point_mf.sort_values("SOC",ascending=False)
import seaborn as sns
p, ax = plt.subplots(figsize=(10, 6))
ax=sns.barplot(y="SOC",
x="NLCD_DES",
data=point_mf,
capsize=.2,
#width=.3,
palette="Blues_d")
ax.axes.xaxis.label.set_text("NLCD")
ax.axes.yaxis.label.set_text("Soil OC (mg/g)")
#plt.title('Mean ± SE of Soil OC')
import seaborn as sns
p, ax = plt.subplots(figsize=(10, 7))
ax=sns.barplot(x="SOC",
y="NLCD_DES",
data=point_mf,
capsize=.2,
#width=.3,
palette="Blues_d")
ax.axes.yaxis.label.set_text("")
ax.axes.xaxis.label.set_text("Soil OC (mg/g)")
point_mf.plot(x='NDVI', y='SOC', style ='o')
import seaborn as sns
sns.jointplot(x=point_mf["NDVI"], y=point_mf["SOC"], kind='scatter')
sns.jointplot(x=point_mf["NDVI"], y=point_mf["SOC"], kind='hex')
sns.jointplot(x=point_mf["NDVI"], y=point_mf["SOC"], kind='kde')
import plotly.express as px
fig = px.scatter_3d(point_mf, x='DEM', y='NDVI', z='SOC',
color='SOC',
size='SOC', size_max=18,
opacity=0.7)
# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
We will create a quantile map to see spatial distribution SOC in the four states.
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap, BoundaryNorm
from matplotlib import cm
from mpl_toolkits.axes_grid1 import make_axes_locatable
## Plot Soil C
# set the range for the choropleth
vmin, vmax = 0, 32
# create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(6, 7))
# State boundary
GP_STATE.plot(ax=ax, color="white", edgecolor="gray")
# Point data
point_spdf.plot(column='SOC',
cmap='Spectral',
ax=ax,
alpha=1,
scheme='quantiles',
k =7,
markersize=30,
)
# remove the axis
#ax.axis('off')
# add a title
ax.set_title('Soil Organic (%)',
fontdict={'fontsize': '20', 'fontweight' : '3'})
# create an axes on the right side of ax. The width of cax will be 5%
# of ax and the padding between cax and ax will be fixed at 0.05 inch.
# You need to import mpl_toolkits.axes_grid1
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
# Create colorbar as a legend
sm = plt.cm.ScalarMappable(cmap='Spectral',
norm=plt.Normalize(vmin=vmin, vmax=vmax))
# empty array for the data range
sm._A = []
# add the colorbar to the figure
cbar = fig.colorbar(sm, cax=cax)
import plotly.graph_objects as go
point_spdf['text'] =point_spdf['STATE'] + '' +point_spdf['COUNTY'] + ''+ 'Arrivals: ' +point_spdf['SOC'].astype(str)
fig = go.Figure(data=go.Scattergeo(
locationmode = 'USA-states',
lon = point_spdf['long'],
lat = point_spdf['lat'],
text = point_spdf['text'],
mode = 'markers',
marker = dict(
size = 6,
opacity = 1,
reversescale = False,
autocolorscale = False,
symbol = 'circle-open',
line = dict(
width=0.8,
color='rgba(102, 102, 102)'
),
colorscale = "YlGnBu",
cmin = 0,
color = point_spdf['SOC'],
cmax = point_spdf['SOC'].max(),
colorbar_title="Soil OC <br> (mg/g) "
)))
fig.update_layout(
title={
'text': "Spatial Distribution of Soil Organic C (mg/g)",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
geo = dict(
scope='usa',
projection_type='albers usa',
showland = True,
landcolor = "rgb(250, 250, 250)",
subunitcolor = "rgb(217, 217, 217)",
countrycolor = "rgb(217, 217, 217)",
countrywidth = 0.5,
subunitwidth = 0.5))
fig.show()
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task (Wikipedia, 2020).
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
from scipy.stats import norm
fig = plt.figure(figsize=(7, 6))
sns.distplot(point_mf['SOC'], fit=norm);
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(point_mf['SOC'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SOC distribution')
#Get also the QQ-plot
fig = plt.figure(figsize=(7, 6))
res =st.probplot(point_mf['SOC'], plot=plt)
plt.show()
You can create a nice looking Q-Q plot with "Pingouin". It is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy.
import numpy as np
import pingouin as pg
ax = pg.qqplot(point_mf['SOC'], dist='norm')
The pingouin.normality() function works with lists, arrays, or pandas DataFrame in wide or long-format.
import numpy as np
import pingouin as pg
print(pg.normality(point_mf['SOC']))
from scipy.stats import skew
skew(point_mf['SOC'])
from sklearn.preprocessing import power_transform
point_mf['SOC_BC'] = st.boxcox(point_mf["SOC"], lmbda=0 )
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(7, 6))
sns.distplot(point_mf['SOC_BC'], fit=norm);
point_mf[['SOC', 'DEM', 'Slope','MAP', 'MAT','NDVI']].describe()
# Descriptive statistics
soc_stat=point_mf.groupby("NLCD_DES")['SOC'].describe().reset_index()
soc_stat['count_squareroot']=soc_stat['count']**(1/2)
# Standard error
soc_stat['sem']= soc_stat['mean']/soc_stat['count_squareroot']
# acending data
mf_soc_stat=soc_stat.sort_values("mean",ascending=True)
mf_soc_stat
import numpy as np
# Create lists for the plot
NLCD = mf_soc_stat['NLCD_DES']
x_pos = np.arange(len(NLCD))
SOC = mf_soc_stat['mean']
error =mf_soc_stat ['sem']
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(NLCD, SOC, yerr=error, align='center', alpha=0.5, ecolor='black', capsize=10)
ax.set_ylabel('Soil SOC (mg/g)')
ax.set_xticks(x_pos)
ax.set_xticklabels(NLCD)
ax.set_title('Mean ± SE of Soil OC')
ax.yaxis.grid(True)
# Save the figure and show
plt.tight_layout()
# plt.savefig('bar_plot_with_error_bars.png')
plt.show()
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among group means in a sample (Source: https://en.wikipedia.org/wiki/Analysis_of_variance).
import numpy as np
import pingouin as pg
aov = pg.anova(data=df, dv='SOC', between='NLCD_DES', detailed=True)
aov
mf=point_mf[['SOC', 'DEM', 'Slope','MAP', 'MAT','NDVI']]
mf.corr()
You can use Pingouin pakage for correlation analysis
import numpy as np
import pingouin as pg
pg.pairwise_corr(mf, columns=['SOC', 'DEM', 'Slope','MAP', 'MAT','NDVI'], method='pearson')
import matplotlib.pyplot as plt
corrmat = mf.corr()
f, ax = plt.subplots(figsize =(9, 8))
sns.heatmap(corrmat, ax = ax, cmap ="YlGnBu", linewidths = 0.1)
cg = sns.clustermap(corrmat, cmap ="YlGnBu", linewidths = 0.1);
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation = 0)
cg
pd.plotting.scatter_matrix(point_mf[['SOC', 'DEM', 'MAT','NDVI']], alpha=0.2, figsize=(10, 10))
plt.show()
Seaborn allows to make a correlogram or correlation matrix really easily. Correlogram are awesome for exploratory analysis: it allows to quickly observe the relationship between every variable of your matrix. It is easy to do it with seaborn: just call the pairplot function
sns.pairplot(point_mf[['SOC', 'DEM', 'MAT','NDVI']], diag_kind="kde")
"Regression is a fundamental problem in statistics and machine learning. In regression studies, we are typically interested in inferring a real-valued function (called a regression function) whose values correspond to the mean of a dependent (or response or output) variable conditioned on one or more independent (or input) variables. Many different techniques for estimating this regression function have been developed, including parametric, semi-parametric, and nonparametric methods." (Source: Quadrianto N., Buntine W.L. (2011) Regression. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_710)
Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model (Wikipedia).
We will use "scikit-learn", a machine learning library for the Python programming language.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy (Wikipedia).
Simple linear regression summarize and study relationships between two continuous (quantitative) variables. One variable denoted x, is regarded as the predictor, explanatory, or independent variable. The other variable denoted y, is regarded as the response, outcome, or dependent variable.
Separate the target value, or "y", from the features. This label is the value that you will train the model to predict.
x = mf['NDVI'].values.reshape(-1,1)
y = mf['SOC'].values.reshape(-1,1)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=42)
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)
print(y_test.shape)
It is good practice to normalize or standardize features that use different scales and ranges. Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.
from sklearn import preprocessing
x_train = preprocessing.scale(x_train)
x_test = preprocessing.scale(x_test)
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr=slr.fit(x_train, y_train) #training the algorithm
print('Intercept:', slr.intercept_)
print('Slope:', slr.coef_)
y_train_pred=slr.predict(x_train)
train_pred= pd.DataFrame({'Actual': y_train.flatten(), 'Predicted': y_train_pred.flatten()})
train_pred['residuals']=train_pred['Actual'] - train_pred['Predicted']
train_pred
Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately SRL predicts the SOC.
y_slr_pred = slr.predict(x_test)
slr_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_slr_pred.flatten()})
slr_pred.describe()
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=slr_pred)
ax.set_ylim(0,40)
ax.set_xlim(0,40)
ax.plot([0, 40], [0, 40], 'k--', lw=2)
plt.title('SLR Predicted SOC (%)',fontsize=18)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_slr_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_slr_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_slr_pred)))
Multiple linear regression (MLR) explore relationships of several explanatory variables and a response variable. The goal of MLR is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable.
Xm =mf[['DEM', 'Slope', 'MAT', 'MAP','NDVI']].values
y = mf[['SOC']].values
x_train, x_test, y_train, y_test = train_test_split(Xm, y, test_size=0.2, random_state=0)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(x_train, y_train)
pred_mlr = pd.DataFrame(['DEM', 'Slope', 'MAT', 'MAP','NDVI'])
coeff = pd.DataFrame(mlr.coef_, index=['Co-efficient']).transpose()
pd.concat([pred_mlr,coeff], axis=1, join='inner')
import sklearn.metrics as metrics
y_train_mlr=mlr.predict(x_train)
explained_variance=metrics.explained_variance_score(y_train, y_train_mlr)
mean_absolute_error=metrics.mean_absolute_error(y_train, y_train_mlr)
mse = metrics.mean_squared_error(y_train, y_train_mlr)
median_absolute_error = metrics.median_absolute_error(y_train, y_train_mlr)
r2 = metrics.r2_score(y_train, y_train_mlr)
print('explained_variance: ', round(explained_variance,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
import numpy as np
import pingouin as pg
pg.linear_regression(mf[['DEM', 'Slope', 'MAT', 'MAP','NDVI']], mf['SOC'])
"Cross-validation is a process for creating a distribution of pairs of training and test sets out of a single data set. In cross validation the data are partitioned into k subsets, S1…Sk, each called a fold. The folds are usually of approximately the same size. The learning algorithm is then applied k times, for i = 1 to k, each time using the union of all subsets other than Si as the training set and using Si as the test set". (Source: (2011) Cross-Validation. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_190)
In K-Folds Cross Validation data is splitted into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set. We will use entire data set (n=666) in this tutorial.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate
import sklearn.metrics as metrics
cv_Kfold = KFold(n_splits=10, random_state=1, shuffle=True)
for i in range(1,11):
mlr = LinearRegression()
mlr.fit(Xm, y)
cv_scores_Kfold = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_Kfold, n_jobs=1)
all_scores = pd.DataFrame(cross_validate(mlr, Xm, y, scoring=['r2', 'neg_mean_absolute_error'], n_jobs=-1, verbose=0))
print(all_scores)
# mean and STD
print("Folds: " + str(len(cv_scores_Kfold)) + ", MSE: " + str(np.mean(np.abs(cv_scores_Kfold))) + ", STD: " + str(np.std(cv_scores_Kfold)))
# Cross-validation prediction
y_pred_KFold = cross_val_predict(mlr, Xm, y, cv=10)
Kfold_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_KFold.flatten()})
Kfold_pred.describe()
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=Kfold_pred)
ax.set_ylim(0,35)
ax.set_xlim(0,35)
ax.plot([0, 35], [0, 35], 'k--', lw=2)
plt.title('K-Fold Cross-Validation',fontsize=18)
import sklearn.metrics as metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred_KFold))
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred_KFold))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred_KFold)))
print('R2:', metrics.r2_score(y, y_pred_KFold))
Leave-one-out cross-validation is a special case of cross-validation where the number of folds equals the number of instances in the data set. Thus, the learning algorithm is applied once for each instance, using all other instances as a training set and using the selected instance as a single-item test set. This process is closely related to the statistical method of jack-knife estimation (Efron, 1982). (Source:(2011) Leave-One-Out Cross-Validation. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_469).
cv_LOV = KFold(n_splits=570, random_state=None, shuffle=False)
cv_scores_LOV = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_LOV, n_jobs=1)
# Mean and SD
print("Folds: " + str(len(cv_scores_LOV)) + ", MSE: " + str(np.mean(np.abs(cv_scores_LOV))) + ", STD: " + str(np.std(cv_scores_LOV)))
# Prediction
y_pred_LOV = cross_val_predict(mlr, Xm, y, cv=10)
LOV_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_LOV.flatten()})
LOV_pred.describe()
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=LOV_pred)
ax.set_ylim(0,35)
ax.set_xlim(0,35)
ax.plot([0, 35], [0, 35], 'k--', lw=2)
plt.title('LOV Cross-Validation',fontsize=18)
Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately SRL predicts the SOC.
y_mlr_pred = mlr.predict(x_test)
mlr_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_mlr_pred.flatten()})
mlr_pred.describe()
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=mlr_pred)
ax.set_ylim(0,30)
ax.set_xlim(0,30)
ax.plot([0, 30], [0, 30], 'k--', lw=2)
plt.title('MLR Predicted SOC (%)',fontsize=18)
import sklearn.metrics as metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_mlr_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_mlr_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_mlr_pred)))
print('R2:', metrics.r2_score(y_test, y_mlr_pred))
Now we will SOC at grid locations of four US States
import pandas as pd
g_id = 'Data_07/gp_grid_data.csv'
grid_df = pd.read_csv(g_id)
grid_df.head()
x_grid=grid_df[['DEM','Slope','MAP','MAT','NDVI']]
x_grid = preprocessing.scale(x_grid)
mlr_grid = mlr.predict(x_grid)
mlr_grid
grid_df['Pred_SOC'] = pd.DataFrame(mlr_grid)
grid_df
import geopandas as gpd
# create data-frame
grid_df_rf = grid_df[['GridID','x','y','Pred_SOC']]
# Convert to spatial point data-frame
grid_rf = gpd.GeoDataFrame(
grid_df_rf, geometry=gpd.points_from_xy(grid_df_rf.x, grid_df_rf.y))
#Save as ESRI shape file
grid_rf.to_file('Data_07/MLR_PRED_SOC.shp')
Convert point to raster is little tricky in python. You have follow several steps to convert point data to raster.
# Input point shape file
mlr_point = 'Data_07/MLR_PRED_SOC.shp'
# Output raster
mlr_output = 'Data_07/MLR_PRED_SOC.tif'
from osgeo import ogr, gdal
from osgeo import gdalconst
# Reference raster
point_ndsm = 'Data_07/DEM.tif'
point_data = gdal.Open(point_ndsm, gdalconst.GA_ReadOnly)
# Get geo-information
geo_transform = point_data.GetGeoTransform()
x_min = geo_transform[0]
y_max = geo_transform[3]
x_max = x_min + geo_transform[1] * point_data.RasterXSize
y_min = y_max + geo_transform[5] * point_data.RasterYSize
x_res = point_data.RasterXSize
y_res = point_data.RasterYSize
mb_v = ogr.Open(mlr_point) # change
mb_l = mb_v.GetLayer()
pixel_width = geo_transform[1]
# Raster coversion
target_ds = gdal.GetDriverByName('GTiff').Create(mlr_output, x_res, y_res, 1, gdal.GDT_Float32) # Change
target_ds.SetGeoTransform((x_min, pixel_width, 0, y_min, 0, pixel_width))
band = target_ds.GetRasterBand(1)
band.FlushCache()
gdal.RasterizeLayer(target_ds, [1], mb_l, options=["ATTRIBUTE=Pred_SOC"]) # Change
target_ds = None
# Open GP=state polygon data
import geopandas as gpd
fp_state="Data_07/GP_state.shp"
state_BD= gpd.GeoDataFrame.from_file(fp_state)
state_BD.geom_type.head()
from osgeo import ogr, gdal
from osgeo import gdalconst
import numpy as np
filename = "Data_07/MLR_PRED_SOC.tif"
gdal_data = gdal.Open(filename)
gdal_band = gdal_data.GetRasterBand(1)
nodataval = gdal_band.GetNoDataValue()
# convert to a numpy array
data_array = gdal_data.ReadAsArray().astype(np.float)
data_array
# replace zero to missing values if necessary
if np.any(data_array == 0):
data_array[data_array == 0] = np.nan
data_array[data_array == 0] = np.nan
%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (7,6))
plt.imshow(data_array,
cmap='jet',
origin='lower')
plt.colorbar(cax = fig.add_axes([1, 0.25, 0.05, 0.5]))
plt.title("MLR Predicted \n SOC (mg/g)")
plt.show()