Lesson_06_wotking_with_point_data.png

As I discussed earlier that zero-dimensional points are essential for geographical features like wells, soil-sampling locations that can be best expressed by a single reference point. Often point vector data consist of several points into a multi-point structure, with a single attribute record. For example, all the soil sampling points could be considered as a single geometry. The main drawback of a point feature, however, is that they cannot be used to make measurements (as you can with a polygon)

In this section, we will use 650 soil samples from Colorado, Kansas, New Mexico, and Wyoming. These samples were collected by the United States Geological Survey (USGS) as a part of the USGS Geochemical Landscapes Project [Smith et al., 2011]. We will learn how to join spatial and non-spatial data and create a dataframe for spatial interpolation.

This exercise consists of following major steps:

Part-1: Create a Spatial Point Data Frame

Part-2: Data Visualization

Part 3: Exploratory Data Analysis

Part 4: Statistical Analysis

Data

We will use following spatial data to create the data frame. The data could be download from here.

  • GP_ID.: Containts FIPS code, State and County names

  • Soil sampling locations (GP_GPS.CSV): This file contains coordinates of 473 soil sampling sites in Colorado, Kansas, New Mexico, and Wyoming. The soil samples were collected by United States Geological Survey (USGS) throughout the A horizon with sampling densities of 1 sample per ~1600 km2 (Smith et al., 2011).

  • Soil organic carbon (SOC) (GP_SOC.csv): SOC concentration of these 473 sites. The concentration of SOC was predicted by means of mid-infrared (MIR) spectroscopy and partial least squares regression (PLSR) analysis described previously (Janik et al., 2007; Ahmed et al., 2017).

  • Environmental covariables (gp_point_data.csv)

  • Prediction grid data

Setting consistent plotting style throughout notebook

In [1]:
import seaborn as sns

sns.set_style("white")
sns.set(font_scale=1.5)

Part 1: Create a Spatial Point Data Frame

In [2]:
## Create Woking directory 
In [3]:
import os

path= "E:\GitHub\geospatial-python-github.io\geospatial-python.io\Lesson_07_working_with_point_data"
os.chdir(path)
print("Current Working Directory " , os.getcwd())
Current Working Directory  E:\GitHub\geospatial-python-github.io\geospatial-python.io\Lesson_07_working_with_point_data

Load CSV Files as Pandas Data Frame

1. Site ID file

In [4]:
import pandas as pd

# Read ID file
f_id = 'Data_07/GP_ID.csv'
id_df = pd.read_csv(f_id)
id_df.head()
Out[4]:
SiteID FIPS STATE_ID STATE COUNTY
0 1 8001 8 Colorado Adams County
1 2 8003 8 Colorado Alamosa County
2 3 8005 8 Colorado Arapahoe County
3 4 8005 8 Colorado Arapahoe County
4 5 8005 8 Colorado Arapahoe County

2. Geo-coordnates Soil sampling Sites

In [5]:
import pandas as pd

# Read GPS of soil sampling locations
f_gps = 'Data_07/GP_GPS.csv'
gps = pd.read_csv(f_gps)
gps.head()
Out[5]:
SiteID long lat
0 1 -104.444250 39.95648
1 2 -105.927230 37.60194
2 3 -104.840678 39.64298
3 4 -104.504870 39.61835
4 5 -104.421550 39.61246

3. Soil Carbon Data

In [6]:
import pandas as pd

#  Soil OC data
f_soc = 'Data_07/GP_SOC.csv'
soc = pd.read_csv(f_soc)
soc.head()
Out[6]:
SiteID SOC
0 1 2.304
1 2 3.154
2 3 1.309
3 4 4.717
4 5 2.230

4. Environmental Covariables

In [7]:
import pandas as pd

f_co = 'Data_07/gp_point_data.csv'
co_var = pd.read_csv(f_co)
# Convert NLCD to integer
co_var['NLCD']=co_var['NLCD'].astype(int)
co_var.head()
Out[7]:
SiteID DEM Slope MAP MAT NDVI NLCD
0 1 1551.130005 1.30141 384.401001 9.79839 0.331707 7
1 2 2308.689941 0.85811 194.281998 5.36274 0.430466 7
2 3 1689.930054 2.96257 469.255005 10.05500 0.471263 6
3 4 1822.170044 2.71405 470.238007 9.39621 0.396961 6
4 5 1758.680054 1.79807 439.721985 9.52607 0.348821 6

5. NLCD ID File

In [8]:
import pandas as pd

f_nlcd = 'Data_07/NLCD_ID.csv'
nlcd_id = pd.read_csv(f_nlcd)
nlcd_id.head()
Out[8]:
NLCD_DES NLCD
0 Forest 4
1 Herbaceous 6
2 Planted/Cultivated 7
3 Shrubland 5

Convert Pandas Data Frame to Geo Data Frame

A GeoDataFrame needs a shapely object. We use geopandas points_from_xy() to transform Longitude and Latitude into a list of shapely.Point objects and set it as a geometry while creating the GeoDataFrame. (note that points_from_xy() is an enhanced wrapper for [Point(x, y) for x, y in zip(df.Longitude, df.Latitude)])

In [9]:
import geopandas as gpd

gps_point = gpd.GeoDataFrame(
    gps, geometry=gpd.points_from_xy(gps.long, gps.lat))
In [10]:
gps_point.head()
Out[10]:
SiteID long lat geometry
0 1 -104.444250 39.95648 POINT (-104.44425 39.95648)
1 2 -105.927230 37.60194 POINT (-105.92723 37.60194)
2 3 -104.840678 39.64298 POINT (-104.84068 39.64298)
3 4 -104.504870 39.61835 POINT (-104.50487 39.61835)
4 5 -104.421550 39.61246 POINT (-104.42155 39.61246)
In [11]:
# Plot data
gps_point.plot()
Out[11]:
<AxesSubplot:>

Define Map Projection

We will transform point shape file from WGS 1984 to Albers Equal Area Conic NAD1983 (EPSG:5070). We could reproject the data from one CRS to another CRS using the geopandas method (i.e. function belonging to geopandas objects): to_crs(). Function .to_crs() will only work if your original spatial object has a CRS assigned to it AND if that CRS is the correct CRS for that data!

In [12]:
print(gps_point.crs)
None

We have define the projection before change CRS to projected coordinatesystem. We assign the WGS84 latitude-longitude CRS (EPSG:4326) to the crs attribute:

In [13]:
# Define CRS
gps_point.crs = ({'init' :'epsg:4326'})
E:\PyGeo\lib\site-packages\pyproj\crs\crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))

Now we will repoject gps_point data using the CRS of GP_STATE shape file. First we extract CRS from State shape file

In [14]:
import geopandas as gpd

# US State shape files
fn="Data_07/GP_STATE.shp"
GP_STATE= gpd.read_file(fn)
# CRS of state polygon
GP_crs=GP_STATE.crs
GP_crs
Out[14]:
<Projected CRS: PROJCS["Albers_Conical_Equal_Area",GEOGCS["NAD83", ...>
Name: Albers_Conical_Equal_Area
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: unnamed
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
In [15]:
# Repojection to  epsg:5070 projection
point_proj=gps_point.to_crs(GP_crs)
In [16]:
import matplotlib.pyplot as plt

# Make subplots that are next to each other
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 8))

# Plot the data in WGS84 CRS
gps_point.plot(ax=ax1, facecolor='gray');

# Add title
ax1.set_title("WGS84");

# Plot the one 
point_proj.plot(ax=ax2, facecolor='blue');

# Add title
ax2.set_title("Albers Equal Area Conic");

# Remove empty white space around the plot
plt.tight_layout()

Merging Tabular and Spatial Data

Attribute joins are accomplished using the merge() function. In general, it is recommended to use the merge method called from the spatial dataset. The stand-alone merge() function will work if the GeoDataFrame is in the left argument; if a DataFrame is in the left argument and a GeoDataFrame is in the right position, the result will no longer be a GeoDataFrame.

First we will create PD frame with all non-spatial data. We will use pd.merge() function to create a dataframe.

In [17]:
from functools import reduce

df1 = [id_df, soc,co_var]
df2 = reduce(lambda left,right: pd.merge(left,right,on='SiteID'), df1)
df2
Out[17]:
SiteID FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD
0 1 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7
1 2 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7
2 3 8005 8 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6
3 4 8005 8 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6
4 5 8005 8 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6
... ... ... ... ... ... ... ... ... ... ... ... ...
565 566 56045 56 Wyoming Weston County 4.701 1296.369995 2.50288 315.881012 7.58175 0.345824 6
566 567 56045 56 Wyoming Weston County 4.701 1296.369995 2.50288 315.881012 7.58175 0.345824 6
567 568 56045 56 Wyoming Weston County 4.701 1296.369995 2.50288 315.881012 7.58175 0.345824 6
568 569 56045 56 Wyoming Weston County 3.105 1286.000000 2.30247 331.578003 7.55435 0.352597 6
569 570 56045 56 Wyoming Weston County 5.600 1417.859985 4.09692 434.790009 6.45444 0.489789 5

570 rows × 12 columns

Now we add NLCD Description column to the data-frame

In [18]:
from functools import reduce

d3 = [df2, nlcd_id]
df = reduce(lambda left,right: pd.merge(left,right,on='NLCD'), d3)
df
Out[18]:
SiteID FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES
0 1 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated
1 2 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated
2 12 8009 8 Colorado Baca County 6.129 1221.199951 1.06476 443.065002 12.13140 0.366326 7 Planted/Cultivated
3 13 8009 8 Colorado Baca County 6.129 1221.199951 1.06476 443.065002 12.13140 0.366326 7 Planted/Cultivated
4 14 8009 8 Colorado Baca County 6.129 1221.199951 1.06476 443.065002 12.13140 0.366326 7 Planted/Cultivated
... ... ... ... ... ... ... ... ... ... ... ... ... ...
565 558 56043 56 Wyoming Washakie County 2.899 1380.719971 2.71461 246.516006 7.45715 0.171190 5 Shrubland
566 559 56043 56 Wyoming Washakie County 2.899 1380.719971 2.71461 246.516006 7.45715 0.171190 5 Shrubland
567 562 56043 56 Wyoming Washakie County 1.764 1421.709961 4.37551 290.014008 7.40410 0.212164 5 Shrubland
568 563 56043 56 Wyoming Washakie County 11.615 1631.560059 6.85482 378.261993 6.82645 0.298178 5 Shrubland
569 570 56045 56 Wyoming Weston County 5.600 1417.859985 4.09692 434.790009 6.45444 0.489789 5 Shrubland

570 rows × 13 columns

Now, we will merge GeoDataFrame (point_proj) with a pandas DataFrame (df) by SiteID of each each sampling location. Be

In [19]:
point_spdf = point_proj.merge(df, on='SiteID')
point_spdf.head()
Out[19]:
SiteID long lat geometry FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES
0 1 -104.444250 39.95648 POINT (-714067.484 1913846.289) 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated
1 2 -105.927230 37.60194 POINT (-866566.008 1663513.090) 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated
2 3 -104.840678 39.64298 POINT (-750759.166 1881917.141) 8005 8 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6 Herbaceous
3 4 -104.504870 39.61835 POINT (-722566.371 1876565.288) 8005 8 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6 Herbaceous
4 5 -104.421550 39.61246 POINT (-715564.605 1875277.417) 8005 8 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6 Herbaceous

Write as a ESRI Shape File

In [20]:
point_spdf.to_file("Data_07/gp_point_all_data.shp")

Extract Geometry From Spatial Point Data Frame

Now we will extract projected geometry (m) from the spatial point data-frame.

In [21]:
# Get X (m) and Y (m) Values from geometry column
point_spdf['x'] = point_spdf.geometry.x
point_spdf['y'] = point_spdf.geometry.y
point_spdf.head()
Out[21]:
SiteID long lat geometry FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES x y
0 1 -104.444250 39.95648 POINT (-714067.484 1913846.289) 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated -714067.484244 1.913846e+06
1 2 -105.927230 37.60194 POINT (-866566.008 1663513.090) 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated -866566.008141 1.663513e+06
2 3 -104.840678 39.64298 POINT (-750759.166 1881917.141) 8005 8 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6 Herbaceous -750759.166338 1.881917e+06
3 4 -104.504870 39.61835 POINT (-722566.371 1876565.288) 8005 8 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6 Herbaceous -722566.371080 1.876565e+06
4 5 -104.421550 39.61246 POINT (-715564.605 1875277.417) 8005 8 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6 Herbaceous -715564.604578 1.875277e+06

Convert Geo Data Frame to Pandas Data Frame

Now we will convert geopandas data frame to pandas data frame without geometry column. We will use this data-frame for futher analysis.

In [22]:
import pandas as pd

point_df = pd.DataFrame(point_spdf.drop(columns=['geometry']))

# Organize variables
point_mf= point_df[['SiteID', 'long', 'lat', 'x', 'y', 'FIPS','STATE', 'COUNTY', 
    'SOC', 'DEM','Slope','MAP', 'MAT','NDVI','NLCD','NLCD_DES']]
point_mf.head()
Out[22]:
SiteID long lat x y FIPS STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES
0 1 -104.444250 39.95648 -714067.484244 1.913846e+06 8001 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated
1 2 -105.927230 37.60194 -866566.008141 1.663513e+06 8003 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated
2 3 -104.840678 39.64298 -750759.166338 1.881917e+06 8005 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6 Herbaceous
3 4 -104.504870 39.61835 -722566.371080 1.876565e+06 8005 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6 Herbaceous
4 5 -104.421550 39.61246 -715564.604578 1.875277e+06 8005 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6 Herbaceous

Write as a CSV File

In [23]:
point_mf.to_csv('Data_07/gp_all_data.csv', header =True, index=False)

Part 2: Data Visualization

Boxplot

Boxplot with Seaborn

In [24]:
import seaborn as sns

p, ax = plt.subplots(figsize=(10, 6))
ax=sns.boxplot(y="SOC", 
            x="NLCD_DES", 
            data=point_mf,
            palette="Blues_d")

ax.axes.xaxis.label.set_text("NLCD")
ax.axes.yaxis.label.set_text("Soil OC (mg/g)")
#plt.title('Mean ± SE of Soil OC')

Interactive Boxplot with Plotly

In [25]:
import plotly.express as px

fig = px.box(point_mf, 
             x="NLCD_DES", 
             y="SOC",
             width=800, 
             height=500)
fig.update_layout(
    title={
        'text': "Sourface Soil Organic (mg/g)",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
     xaxis_title="NLCD",
     yaxis_title="Soil OC",
     font=dict(
        family="sans-serif, monospace",
        size=18,
        color="#7f7f7f"))

fig.show()

Barplot

In [26]:
# short data-set
mf_soc =point_mf.sort_values("SOC",ascending=False)
In [27]:
import seaborn as sns

p, ax = plt.subplots(figsize=(10, 6))
ax=sns.barplot(y="SOC", 
            x="NLCD_DES", 
            data=point_mf,
            capsize=.2, 
            #width=.3,
            palette="Blues_d")

ax.axes.xaxis.label.set_text("NLCD")
ax.axes.yaxis.label.set_text("Soil OC (mg/g)")
#plt.title('Mean ± SE of Soil OC')
In [28]:
import seaborn as sns

p, ax = plt.subplots(figsize=(10, 7))
ax=sns.barplot(x="SOC", 
            y="NLCD_DES", 
            data=point_mf,
            capsize=.2, 
            #width=.3,
            palette="Blues_d")

ax.axes.yaxis.label.set_text("")
ax.axes.xaxis.label.set_text("Soil OC (mg/g)")

Scatter Plot

In [29]:
point_mf.plot(x='NDVI', y='SOC', style ='o')
Out[29]:
<AxesSubplot:xlabel='NDVI'>
In [30]:
import seaborn as sns

sns.jointplot(x=point_mf["NDVI"], y=point_mf["SOC"], kind='scatter')
sns.jointplot(x=point_mf["NDVI"], y=point_mf["SOC"], kind='hex')
sns.jointplot(x=point_mf["NDVI"], y=point_mf["SOC"], kind='kde')
Out[30]:
<seaborn.axisgrid.JointGrid at 0x1d5af9b8148>

3D Plot

In [31]:
import plotly.express as px

fig = px.scatter_3d(point_mf, x='DEM', y='NDVI', z='SOC',
                    color='SOC',
                    size='SOC', size_max=18,
                    opacity=0.7)

# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))

Map Spatial Distribution of SOC

We will create a quantile map to see spatial distribution SOC in the four states.

In [32]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap, BoundaryNorm
from matplotlib import cm
from mpl_toolkits.axes_grid1 import make_axes_locatable

## Plot Soil C
# set the range for the choropleth
vmin, vmax = 0, 32
# create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(6, 7))
# State boundary
GP_STATE.plot(ax=ax, color="white", edgecolor="gray")
# Point data
point_spdf.plot(column='SOC', 
              cmap='Spectral', 
              ax=ax,
              alpha=1, 
              scheme='quantiles',
              k =7,
              markersize=30, 
              )
# remove the axis
#ax.axis('off')
# add a title
ax.set_title('Soil Organic (%)',
            fontdict={'fontsize': '20', 'fontweight' : '3'})
# create an axes on the right side of ax. The width of cax will be 5%
# of ax and the padding between cax and ax will be fixed at 0.05 inch.
# You need to import mpl_toolkits.axes_grid1 
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
# Create colorbar as a legend
sm = plt.cm.ScalarMappable(cmap='Spectral',
                           norm=plt.Normalize(vmin=vmin, vmax=vmax))
# empty array for the data range
sm._A = []
# add the colorbar to the figure
cbar = fig.colorbar(sm, cax=cax)

Interactive Map with Plotly

In [33]:
import plotly.graph_objects as go


point_spdf['text'] =point_spdf['STATE'] + '' +point_spdf['COUNTY'] + ''+  'Arrivals: ' +point_spdf['SOC'].astype(str)


fig = go.Figure(data=go.Scattergeo(
        locationmode = 'USA-states',
        lon = point_spdf['long'],
        lat = point_spdf['lat'],
        text = point_spdf['text'],
        mode = 'markers',
        marker = dict(
            size = 6,
            opacity = 1,
            reversescale = False,
            autocolorscale = False,
            symbol = 'circle-open',
            line = dict(
                width=0.8,
                color='rgba(102, 102, 102)'
            ),
            colorscale = "YlGnBu",
            cmin = 0,
            color = point_spdf['SOC'],
            cmax = point_spdf['SOC'].max(),
            colorbar_title="Soil OC <br> (mg/g) "
        )))

fig.update_layout(
        title={
        'text': "Spatial Distribution of Soil Organic C (mg/g)",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        geo = dict(
            scope='usa',
            projection_type='albers usa',
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5))
        

fig.show()

Part 3: Exploratory Data Analysis

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task (Wikipedia, 2020).

Distribution of SOC

In [34]:
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
from scipy.stats import norm 

fig = plt.figure(figsize=(7, 6))
sns.distplot(point_mf['SOC'], fit=norm);


# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(point_mf['SOC'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SOC distribution')

#Get also the QQ-plot
fig = plt.figure(figsize=(7, 6))
res =st.probplot(point_mf['SOC'], plot=plt)
plt.show()
 mu = 6.11 and sigma = 5.01

You can create a nice looking Q-Q plot with "Pingouin". It is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy.

In [35]:
import numpy as np
import pingouin as pg

ax = pg.qqplot(point_mf['SOC'], dist='norm')

Normality Test

The pingouin.normality() function works with lists, arrays, or pandas DataFrame in wide or long-format.

In [36]:
import numpy as np
import pingouin as pg

print(pg.normality(point_mf['SOC']))
            W          pval  normal
SOC  0.866947  1.307438e-21   False

Skewness of SOC

In [37]:
from scipy.stats import skew 

skew(point_mf['SOC'])
Out[37]:
1.5317088546155244

Boxcox Transformation

In [38]:
from sklearn.preprocessing import power_transform

point_mf['SOC_BC'] = st.boxcox(point_mf["SOC"], lmbda=0 )
In [39]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(7, 6))
sns.distplot(point_mf['SOC_BC'], fit=norm);

Descriptive Statistics

In [40]:
point_mf[['SOC', 'DEM', 'Slope','MAP', 'MAT','NDVI']].describe()
Out[40]:
SOC DEM Slope MAP MAT NDVI
count 570.000000 570.000000 570.000000 570.000000 570.000000 570.000000
mean 6.110298 1627.776373 4.780463 497.145379 8.846380 0.432971
std 5.010023 750.175428 4.715820 208.120256 4.061732 0.162786
min 0.152000 258.648987 0.649253 194.281998 -0.849688 0.153098
25% 2.569750 1177.099976 1.370800 357.262756 6.008715 0.305333
50% 4.701000 1595.719971 2.647395 432.712509 9.099600 0.413890
75% 8.625500 2230.392456 6.852565 580.362259 12.327850 0.550799
max 30.473000 3618.020020 26.104200 1128.109985 17.018200 0.796992

Summary Statistics by Group

In [41]:
# Descriptive statistics
soc_stat=point_mf.groupby("NLCD_DES")['SOC'].describe().reset_index()
soc_stat['count_squareroot']=soc_stat['count']**(1/2)
# Standard error
soc_stat['sem']= soc_stat['mean']/soc_stat['count_squareroot']
# acending data 
mf_soc_stat=soc_stat.sort_values("mean",ascending=True)
mf_soc_stat
Out[41]:
NLCD_DES count mean std min 25% 50% 75% max count_squareroot sem
3 Shrubland 160.0 3.703113 3.511537 0.244 1.1275 2.8555 4.90525 19.099 12.649111 0.292757
1 Herbaceous 185.0 5.184076 3.760195 0.152 2.3520 4.5170 6.53600 18.814 13.601471 0.381141
2 Planted/Cultivated 111.0 7.199550 3.482110 0.462 4.3100 7.0960 9.78550 15.740 10.535654 0.683351
0 Forest 114.0 9.931298 6.979229 0.736 4.4470 7.7185 14.26750 30.473 10.677078 0.930151
In [42]:
import numpy as np

# Create lists for the plot
NLCD = mf_soc_stat['NLCD_DES']
x_pos = np.arange(len(NLCD))
SOC = mf_soc_stat['mean']
error =mf_soc_stat ['sem']
In [43]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(NLCD, SOC, yerr=error, align='center', alpha=0.5, ecolor='black', capsize=10)
ax.set_ylabel('Soil SOC (mg/g)')
ax.set_xticks(x_pos)
ax.set_xticklabels(NLCD)
ax.set_title('Mean ± SE of  Soil OC')
ax.yaxis.grid(True)

# Save the figure and show
plt.tight_layout()
# plt.savefig('bar_plot_with_error_bars.png')
plt.show()
In [ ]:
 

Part 4: Statistcal Analysis

Analysis of Variance

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among group means in a sample (Source: https://en.wikipedia.org/wiki/Analysis_of_variance).

In [44]:
import numpy as np
import pingouin as pg

aov = pg.anova(data=df, dv='SOC', between='NLCD_DES', detailed=True)
In [45]:
aov
Out[45]:
Source SS DF MS F p-unc np2
0 NLCD_DES 2881.938907 3 960.646302 47.694621 1.713925e-27 0.201787
1 Within 11400.149410 566 20.141607 NaN NaN NaN

Correlation Analysis

In [46]:
mf=point_mf[['SOC', 'DEM', 'Slope','MAP', 'MAT','NDVI']]
In [47]:
mf.corr()
Out[47]:
SOC DEM Slope MAP MAT NDVI
SOC 1.000000 0.127140 0.349538 0.518146 -0.315259 0.597899
DEM 0.127140 1.000000 0.691905 -0.263876 -0.803150 -0.071148
Slope 0.349538 0.691905 1.000000 0.181212 -0.625302 0.298365
MAP 0.518146 -0.263876 0.181212 1.000000 0.025930 0.814111
MAT -0.315259 -0.803150 -0.625302 0.025930 1.000000 -0.206880
NDVI 0.597899 -0.071148 0.298365 0.814111 -0.206880 1.000000

You can use Pingouin pakage for correlation analysis

In [48]:
import numpy as np
import pingouin as pg

pg.pairwise_corr(mf, columns=['SOC', 'DEM', 'Slope','MAP', 'MAT','NDVI'], method='pearson')
Out[48]:
X Y method tail n r CI95% r2 adj_r2 z p-unc BF10 power
0 SOC DEM pearson two-sided 570 0.127140 [0.05, 0.21] 0.016165 0.012694 0.127832 2.356984e-03 5.284 0.861341
1 SOC Slope pearson two-sided 570 0.349538 [0.28, 0.42] 0.122177 0.119080 0.364917 8.005180e-18 5.447e+14 1.000000
2 SOC MAP pearson two-sided 570 0.518146 [0.46, 0.58] 0.268475 0.265895 0.573802 1.774551e-40 1.393e+37 1.000000
3 SOC MAT pearson two-sided 570 -0.315259 [-0.39, -0.24] 0.099388 0.096211 -0.326374 1.282272e-14 3.856e+11 1.000000
4 SOC NDVI pearson two-sided 570 0.597899 [0.54, 0.65] 0.357483 0.355217 0.689871 1.533655e-56 1.23e+53 1.000000
5 DEM Slope pearson two-sided 570 0.691905 [0.65, 0.73] 0.478733 0.476894 0.851602 2.132024e-82 6.216e+78 1.000000
6 DEM MAP pearson two-sided 570 -0.263876 [-0.34, -0.19] 0.069631 0.066349 -0.270270 1.554746e-10 3.897e+07 0.999996
7 DEM MAT pearson two-sided 570 -0.803150 [-0.83, -0.77] 0.645050 0.643798 -1.107424 7.360595e-130 1.059e+126 1.000000
8 DEM NDVI pearson two-sided 570 -0.071148 [-0.15, 0.01] 0.005062 0.001553 -0.071268 8.968713e-02 0.221 0.396904
9 Slope MAP pearson two-sided 570 0.181212 [0.1, 0.26] 0.032838 0.029426 0.183236 1.342990e-05 666.436 0.991953
10 Slope MAT pearson two-sided 570 -0.625302 [-0.67, -0.57] 0.391003 0.388854 -0.733664 3.614999e-63 4.732e+59 1.000000
11 Slope NDVI pearson two-sided 570 0.298365 [0.22, 0.37] 0.089022 0.085808 0.307724 3.488650e-13 1.512e+10 1.000000
12 MAP MAT pearson two-sided 570 0.025930 [-0.06, 0.11] 0.000672 -0.002853 0.025936 5.366929e-01 0.063 0.094749
13 MAP NDVI pearson two-sided 570 0.814111 [0.78, 0.84] 0.662777 0.661588 1.139101 3.482758e-136 2.098e+132 1.000000
14 MAT NDVI pearson two-sided 570 -0.206880 [-0.28, -0.13] 0.042799 0.039423 -0.209910 6.281647e-07 1.248e+04 0.998827
In [49]:
import matplotlib.pyplot as plt

corrmat = mf.corr() 
f, ax = plt.subplots(figsize =(9, 8)) 
sns.heatmap(corrmat, ax = ax, cmap ="YlGnBu", linewidths = 0.1) 
Out[49]:
<AxesSubplot:>

Grid Correlation Matrix

In [50]:
cg = sns.clustermap(corrmat, cmap ="YlGnBu", linewidths = 0.1); 
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation = 0) 
  
cg 
Out[50]:
<seaborn.matrix.ClusterGrid at 0x1d5b3c643c8>

Correlation Matrix Plot With Pandas

In [51]:
pd.plotting.scatter_matrix(point_mf[['SOC', 'DEM', 'MAT','NDVI']], alpha=0.2, figsize=(10, 10))
plt.show()

Correlation Matrix Plot With Seaborn

Seaborn allows to make a correlogram or correlation matrix really easily. Correlogram are awesome for exploratory analysis: it allows to quickly observe the relationship between every variable of your matrix. It is easy to do it with seaborn: just call the pairplot function

In [52]:
sns.pairplot(point_mf[['SOC', 'DEM', 'MAT','NDVI']], diag_kind="kde")
Out[52]:
<seaborn.axisgrid.PairGrid at 0x1d5b45cc248>

Regression Analysis

"Regression is a fundamental problem in statistics and machine learning. In regression studies, we are typically interested in inferring a real-valued function (called a regression function) whose values correspond to the mean of a dependent (or response or output) variable conditioned on one or more independent (or input) variables. Many different techniques for estimating this regression function have been developed, including parametric, semi-parametric, and nonparametric methods." (Source: Quadrianto N., Buntine W.L. (2011) Regression. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_710)

Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model (Wikipedia).

We will use "scikit-learn", a machine learning library for the Python programming language.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy (Wikipedia).

Simple Linear Regression (SLR)

Simple linear regression summarize and study relationships between two continuous (quantitative) variables. One variable denoted x, is regarded as the predictor, explanatory, or independent variable. The other variable denoted y, is regarded as the response, outcome, or dependent variable.

Separate the target value, or "y", from the features. This label is the value that you will train the model to predict.

In [53]:
x = mf['NDVI'].values.reshape(-1,1)
y = mf['SOC'].values.reshape(-1,1)

Split the data into train and test

In [54]:
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)
In [55]:
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)
print(y_test.shape)
(427, 1)
(143, 1)
(427, 1)
(143, 1)

Standardize features

It is good practice to normalize or standardize features that use different scales and ranges. Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

In [56]:
from sklearn import preprocessing

x_train = preprocessing.scale(x_train)
x_test = preprocessing.scale(x_test)

Fit SLR

In [57]:
from sklearn.linear_model import LinearRegression

slr = LinearRegression()  
slr=slr.fit(x_train, y_train) #training the algorithm
In [58]:
print('Intercept:', slr.intercept_) 
print('Slope:', slr.coef_) 
Intercept: [6.16803279]
Slope: [[3.0336134]]

Calculate Residuals

In [59]:
y_train_pred=slr.predict(x_train)
train_pred= pd.DataFrame({'Actual': y_train.flatten(), 'Predicted': y_train_pred.flatten()})
train_pred['residuals']=train_pred['Actual'] - train_pred['Predicted']
train_pred
Out[59]:
Actual Predicted residuals
0 1.823 7.268126 -5.445126
1 11.220 8.201743 3.018257
2 8.328 6.759348 1.568652
3 8.780 11.474975 -2.694975
4 10.720 5.046893 5.673107
... ... ... ...
422 6.448 5.922508 0.525492
423 6.361 6.476301 -0.115301
424 2.335 6.719013 -4.384013
425 3.134 4.764200 -1.630200
426 2.821 5.431658 -2.610658

427 rows × 3 columns

Prediction at test locations

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately SRL predicts the SOC.

In [60]:
y_slr_pred = slr.predict(x_test)
In [61]:
slr_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_slr_pred.flatten()})
slr_pred.describe()
Out[61]:
Observed Predicted
count 143.000000 143.000000
mean 5.937902 6.168033
std 4.371507 3.044276
min 0.152000 1.104464
25% 2.769500 3.881187
50% 4.701000 5.921257
75% 8.550000 8.180455
max 21.591000 12.142558
In [62]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=slr_pred)
ax.set_ylim(0,40)
ax.set_xlim(0,40)
ax.plot([0, 40], [0, 40], 'k--', lw=2)
plt.title('SLR Predicted SOC (%)',fontsize=18)
Out[62]:
Text(0.5, 1.0, 'SLR Predicted SOC (%)')
In [63]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_slr_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_slr_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_slr_pred)))
Mean Absolute Error: 2.4996874518375476
Mean Squared Error: 10.842598955878769
Root Mean Squared Error: 3.292810191292351

Multiple Linear Regression (MLR)

Multiple linear regression (MLR) explore relationships of several explanatory variables and a response variable. The goal of MLR is to model the linear relationship between the explanatory (independent) variables and response (dependent) variable.

Create a data frame

In [64]:
Xm =mf[['DEM', 'Slope', 'MAT', 'MAP','NDVI']].values
y = mf[['SOC']].values

Split to training and test data

In [65]:
x_train, x_test, y_train, y_test = train_test_split(Xm, y, test_size=0.2, random_state=0)
In [66]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(456, 5)
(114, 5)
(456, 1)
(114, 1)

Fit MLR

In [67]:
from sklearn.linear_model import LinearRegression

mlr = LinearRegression()  
mlr.fit(x_train, y_train)
Out[67]:
LinearRegression()

Co-efficients MLR model

In [68]:
pred_mlr = pd.DataFrame(['DEM', 'Slope', 'MAT', 'MAP','NDVI'])
coeff = pd.DataFrame(mlr.coef_, index=['Co-efficient']).transpose() 
In [69]:
pd.concat([pred_mlr,coeff], axis=1, join='inner')
Out[69]:
0 Co-efficient
0 DEM -0.000062
1 Slope 0.076634
2 MAT -0.278529
3 MAP 0.005548
4 NDVI 10.528183

Training Statistics

In [70]:
import sklearn.metrics as metrics

y_train_mlr=mlr.predict(x_train)
In [71]:
explained_variance=metrics.explained_variance_score(y_train, y_train_mlr)
mean_absolute_error=metrics.mean_absolute_error(y_train, y_train_mlr) 
mse = metrics.mean_squared_error(y_train, y_train_mlr) 
median_absolute_error = metrics.median_absolute_error(y_train, y_train_mlr)
r2 = metrics.r2_score(y_train, y_train_mlr)
In [72]:
print('explained_variance: ', round(explained_variance,4))    
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
explained_variance:  0.4064
r2:  0.4064
MAE:  2.7994
MSE:  15.4298
RMSE:  3.9281

MLR with Pingouin

In [73]:
import numpy as np
import pingouin as pg

pg.linear_regression(mf[['DEM', 'Slope', 'MAT', 'MAP','NDVI']], mf['SOC'])
Out[73]:
names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%]
0 Intercept 0.894583 1.615034 0.553910 5.798601e-01 0.416187 0.411011 -2.277632 4.066798
1 DEM -0.000133 0.000492 -0.271273 7.862806e-01 0.416187 0.411011 -0.001100 0.000833
2 Slope 0.094482 0.055896 1.690318 9.151970e-02 0.416187 0.411011 -0.015308 0.204272
3 MAT -0.257827 0.074714 -3.450864 6.006176e-04 0.416187 0.411011 -0.404579 -0.111076
4 MAP 0.005239 0.001478 3.545001 4.251981e-04 0.416187 0.411011 0.002336 0.008141
5 NDVI 10.757545 1.882793 5.713611 1.792271e-08 0.416187 0.411011 7.059403 14.455687

Assessing the Quality of Regression Models

K-fold Cross Validation

"Cross-validation is a process for creating a distribution of pairs of training and test sets out of a single data set. In cross validation the data are partitioned into k subsets, S1…Sk, each called a fold. The folds are usually of approximately the same size. The learning algorithm is then applied k times, for i = 1 to k, each time using the union of all subsets other than Si as the training set and using Si as the test set". (Source: (2011) Cross-Validation. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_190)

In K-Folds Cross Validation data is splitted into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set. We will use entire data set (n=666) in this tutorial.

In [74]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate
import sklearn.metrics as metrics

cv_Kfold = KFold(n_splits=10, random_state=1, shuffle=True)

for i in range(1,11):
    mlr = LinearRegression()  
    mlr.fit(Xm, y)
    cv_scores_Kfold = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_Kfold, n_jobs=1)
In [75]:
all_scores = pd.DataFrame(cross_validate(mlr, Xm, y, scoring=['r2', 'neg_mean_absolute_error'], n_jobs=-1, verbose=0))
print(all_scores)
   fit_time  score_time   test_r2  test_neg_mean_absolute_error
0  0.006001    0.002002  0.310929                     -3.639326
1  0.003999    0.002001  0.265494                     -3.101695
2  0.005001    0.002002  0.326416                     -2.331438
3  0.003999    0.002000  0.322108                     -1.805035
4  0.006998    0.000999  0.483584                     -2.947291
In [76]:
# mean and STD
print("Folds: " + str(len(cv_scores_Kfold)) + ", MSE: " + str(np.mean(np.abs(cv_scores_Kfold))) + ", STD: " + str(np.std(cv_scores_Kfold)))
Folds: 10, MSE: 15.076049305815697, STD: 3.0349086993795664
In [77]:
# Cross-validation prediction
y_pred_KFold = cross_val_predict(mlr, Xm, y, cv=10)
In [78]:
Kfold_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_KFold.flatten()})
Kfold_pred.describe()
Out[78]:
Observed Predicted
count 570.000000 570.000000
mean 6.110298 6.114798
std 5.010023 3.278569
min 0.152000 -0.206935
25% 2.569750 3.728761
50% 4.701000 5.416715
75% 8.625500 8.582460
max 30.473000 14.906191
In [79]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=Kfold_pred)
ax.set_ylim(0,35)
ax.set_xlim(0,35)
ax.plot([0, 35], [0, 35], 'k--', lw=2)
plt.title('K-Fold Cross-Validation',fontsize=18)
Out[79]:
Text(0.5, 1.0, 'K-Fold Cross-Validation')
In [80]:
import sklearn.metrics as metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred_KFold))  
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred_KFold))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred_KFold)))
print('R2:', metrics.r2_score(y, y_pred_KFold))
Mean Absolute Error: 2.7864972893328037
Mean Squared Error: 15.407817905098318
Root Mean Squared Error: 3.9252793410276317
R2: 0.3850719859175731

Leave-One-Out Cross-Validation

Leave-one-out cross-validation is a special case of cross-validation where the number of folds equals the number of instances in the data set. Thus, the learning algorithm is applied once for each instance, using all other instances as a training set and using the selected instance as a single-item test set. This process is closely related to the statistical method of jack-knife estimation (Efron, 1982). (Source:(2011) Leave-One-Out Cross-Validation. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_469).

In [81]:
cv_LOV = KFold(n_splits=570, random_state=None, shuffle=False)
cv_scores_LOV = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_LOV, n_jobs=1)
In [82]:
# Mean and SD 
print("Folds: " + str(len(cv_scores_LOV)) + ", MSE: " + str(np.mean(np.abs(cv_scores_LOV))) + ", STD: " + str(np.std(cv_scores_LOV)))
Folds: 570, MSE: 15.002241515906002, STD: 32.7190311755415
In [83]:
# Prediction
y_pred_LOV = cross_val_predict(mlr, Xm, y, cv=10)
In [84]:
LOV_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_LOV.flatten()})
LOV_pred.describe()
Out[84]:
Observed Predicted
count 570.000000 570.000000
mean 6.110298 6.114798
std 5.010023 3.278569
min 0.152000 -0.206935
25% 2.569750 3.728761
50% 4.701000 5.416715
75% 8.625500 8.582460
max 30.473000 14.906191
In [85]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=LOV_pred)
ax.set_ylim(0,35)
ax.set_xlim(0,35)
ax.plot([0, 35], [0, 35], 'k--', lw=2)
plt.title('LOV Cross-Validation',fontsize=18)
Out[85]:
Text(0.5, 1.0, 'LOV Cross-Validation')

Prediction at test locations

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately SRL predicts the SOC.

In [86]:
y_mlr_pred = mlr.predict(x_test)
In [87]:
mlr_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_mlr_pred.flatten()})
mlr_pred.describe()
Out[87]:
Observed Predicted
count 114.000000 114.000000
mean 5.729605 5.679312
std 4.616657 3.217073
min 0.152000 -0.032682
25% 2.356500 3.560598
50% 4.562000 5.229532
75% 7.782000 7.728301
max 24.954000 13.828552
In [88]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=mlr_pred)
ax.set_ylim(0,30)
ax.set_xlim(0,30)
ax.plot([0, 30], [0, 30], 'k--', lw=2)
plt.title('MLR Predicted SOC (%)',fontsize=18)
Out[88]:
Text(0.5, 1.0, 'MLR Predicted SOC (%)')
In [89]:
import sklearn.metrics as metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_mlr_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_mlr_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_mlr_pred)))
print('R2:', metrics.r2_score(y_test, y_mlr_pred))
Mean Absolute Error: 2.4324964664504187
Mean Squared Error: 11.391382213714996
Root Mean Squared Error: 3.375112177945349
R2: 0.4608027564669763

Prediction at Grid Locations

Now we will SOC at grid locations of four US States

In [90]:
import pandas as pd

g_id = 'Data_07/gp_grid_data.csv'
grid_df = pd.read_csv(g_id)
grid_df.head()
Out[90]:
GridID x y DEM Slope MAP MAT NDVI NLCD
0 16 -1.210355e+06 1.001813e+06 1670.989990 7.23222 453.648987 14.2848 0.288304 5.0
1 17 -1.205355e+06 1.001813e+06 1670.989990 7.23222 453.648987 14.2848 0.288304 5.0
2 18 -1.200355e+06 1.001813e+06 1563.839966 4.70358 402.542999 14.9359 0.245139 5.0
3 19 -1.195355e+06 1.001813e+06 1563.839966 4.70358 402.542999 14.9359 0.245139 5.0
4 20 -1.190355e+06 1.001813e+06 1428.630005 2.02707 348.114990 15.6579 0.218998 6.0
In [91]:
x_grid=grid_df[['DEM','Slope','MAP','MAT','NDVI']]
x_grid = preprocessing.scale(x_grid)
In [92]:
mlr_grid = mlr.predict(x_grid)
mlr_grid 
Out[92]:
array([[-6.60331576],
       [-6.60331576],
       [-9.42666201],
       ...,
       [ 9.38705417],
       [ 9.38705417],
       [13.35855728]])
In [93]:
grid_df['Pred_SOC'] = pd.DataFrame(mlr_grid)
grid_df
Out[93]:
GridID x y DEM Slope MAP MAT NDVI NLCD Pred_SOC
0 16 -1.210355e+06 1.001813e+06 1670.989990 7.23222 453.648987 14.284800 0.288304 5.0 -6.603316
1 17 -1.205355e+06 1.001813e+06 1670.989990 7.23222 453.648987 14.284800 0.288304 5.0 -6.603316
2 18 -1.200355e+06 1.001813e+06 1563.839966 4.70358 402.542999 14.935900 0.245139 5.0 -9.426662
3 19 -1.195355e+06 1.001813e+06 1563.839966 4.70358 402.542999 14.935900 0.245139 5.0 -9.426662
4 20 -1.190355e+06 1.001813e+06 1428.630005 2.02707 348.114990 15.657900 0.218998 6.0 -11.128264
... ... ... ... ... ... ... ... ... ... ...
41680 42008 -1.180355e+06 2.531813e+06 2482.820068 10.74880 1091.209961 1.427390 0.594469 5.0 12.825463
41681 42009 -1.175355e+06 2.531813e+06 2482.820068 10.74880 1091.209961 1.427390 0.594469 5.0 12.825463
41682 42010 -1.170355e+06 2.531813e+06 2693.919922 15.65970 1066.500000 0.205696 0.540465 4.0 9.387054
41683 42011 -1.165355e+06 2.531813e+06 2693.919922 15.65970 1066.500000 0.205696 0.540465 4.0 9.387054
41684 42012 -1.160355e+06 2.531813e+06 2214.010010 14.72760 552.973999 2.368980 0.591660 5.0 13.358557

41685 rows × 10 columns

Covert to SPDF

In [94]:
import geopandas as gpd

# create data-frame
grid_df_rf = grid_df[['GridID','x','y','Pred_SOC']]
# Convert to spatial point data-frame
grid_rf = gpd.GeoDataFrame(
    grid_df_rf, geometry=gpd.points_from_xy(grid_df_rf.x, grid_df_rf.y))
#Save as ESRI shape file
grid_rf.to_file('Data_07/MLR_PRED_SOC.shp')

Covert to raster

Convert point to raster is little tricky in python. You have follow several steps to convert point data to raster.

In [95]:
# Input  point shape file
mlr_point = 'Data_07/MLR_PRED_SOC.shp'
In [96]:
# Output raster
mlr_output = 'Data_07/MLR_PRED_SOC.tif'
In [97]:
from osgeo import ogr, gdal
from osgeo import gdalconst

# Reference raster
point_ndsm = 'Data_07/DEM.tif'
point_data = gdal.Open(point_ndsm, gdalconst.GA_ReadOnly)

# Get geo-information
geo_transform = point_data.GetGeoTransform()
x_min = geo_transform[0]
y_max = geo_transform[3]
x_max = x_min + geo_transform[1] * point_data.RasterXSize
y_min = y_max + geo_transform[5] * point_data.RasterYSize
x_res = point_data.RasterXSize
y_res = point_data.RasterYSize
mb_v = ogr.Open(mlr_point) # change 
mb_l = mb_v.GetLayer()
pixel_width = geo_transform[1]

# Raster coversion
target_ds = gdal.GetDriverByName('GTiff').Create(mlr_output, x_res, y_res, 1, gdal.GDT_Float32) # Change
target_ds.SetGeoTransform((x_min, pixel_width, 0, y_min, 0, pixel_width))
band = target_ds.GetRasterBand(1)
band.FlushCache()
gdal.RasterizeLayer(target_ds, [1], mb_l, options=["ATTRIBUTE=Pred_SOC"]) # Change 

target_ds = None

Visulaize the Predicted SOC Raster

In [98]:
# Open GP=state polygon data 

import geopandas as gpd

fp_state="Data_07/GP_state.shp"
state_BD= gpd.GeoDataFrame.from_file(fp_state)
state_BD.geom_type.head()
Out[98]:
0    Polygon
1    Polygon
2    Polygon
3    Polygon
dtype: object
In [99]:
from osgeo import ogr, gdal
from osgeo import gdalconst
import numpy as np

filename = "Data_07/MLR_PRED_SOC.tif"
gdal_data = gdal.Open(filename)
gdal_band = gdal_data.GetRasterBand(1)
nodataval = gdal_band.GetNoDataValue()

# convert to a numpy array
data_array = gdal_data.ReadAsArray().astype(np.float)
data_array

# replace zero to missing values if necessary
if np.any(data_array == 0):
    data_array[data_array == 0] = np.nan
    data_array[data_array == 0] = np.nan
In [100]:
%matplotlib inline
import matplotlib.pyplot as plt

fig = plt.figure(figsize = (7,6))

plt.imshow(data_array, 
           cmap='jet',
           origin='lower')
plt.colorbar(cax = fig.add_axes([1, 0.25, 0.05, 0.5]))
plt.title("MLR Predicted \n SOC (mg/g)")

plt.show()