Lesson_06_wotking_with_point_data.png

As I discussed earlier that zero-dimensional points are essential for geographical features like wells, soil-sampling locations that can be best expressed by a single reference point. Often point vector data consist of several points into a multi-point structure, with a single attribute record. For example, all the soil sampling points could be considered as a single geometry. The main drawback of a point feature, however, is that they cannot be used to make measurements (as you can with a polygon)

In this section, we will use 650 soil samples from Colorado, Kansas, New Mexico, and Wyoming. These samples were collected by the United States Geological Survey (USGS) as a part of the USGS Geochemical Landscapes Project [Smith et al., 2011]. We will learn how to join spatial and non-spatial data and create a dataframe for spatial interpolation.

This exercise consists of following major steps:

Part-1: Create a Spatial Point Data Frame

Part-2: Data Visualization

Part 3: Exploratory Data Analysis

Part 4: Statistical Analysis

Data

We will use following spatial data to create the data frame. The data could be download from here.

  • GP_ID.: Containts FIPS code, State and County names

  • Soil sampling locations (GP_GPS.CSV): This file contains coordinates of 473 soil sampling sites in Colorado, Kansas, New Mexico, and Wyoming. The soil samples were collected by United States Geological Survey (USGS) throughout the A horizon with sampling densities of 1 sample per ~1600 km2 (Smith et al., 2011).

  • Soil organic carbon (SOC) (GP_SOC.csv): SOC concentration of these 473 sites. The concentration of SOC was predicted by means of mid-infrared (MIR) spectroscopy and partial least squares regression (PLSR) analysis described previously (Janik et al., 2007; Ahmed et al., 2017).

  • Environmental covariables (gp_point_data.csv)

  • Prediction grid data

Setting consistent plotting style throughout notebook

In [1]:
import seaborn as sns

sns.set_style("white")
sns.set(font_scale=1.5)

Part 1: Create a Spatial Point Data Frame

In [2]:
## Create Woking directory 
In [3]:
import os

path= "E:\GitHub\geospatial-python-github.io\geospatial-python.io\Lesson_07_working_with_point_data"
os.chdir(path)
print("Current Working Directory " , os.getcwd())
Current Working Directory  E:\GitHub\geospatial-python-github.io\geospatial-python.io\Lesson_07_working_with_point_data

Load CSV Files as Pandas Data Frame

1. Site ID file

In [4]:
import pandas as pd

# Read ID file
f_id = 'Data_07/GP_ID.csv'
id_df = pd.read_csv(f_id)
id_df.head()
Out[4]:
SiteID FIPS STATE_ID STATE COUNTY
0 1 8001 8 Colorado Adams County
1 2 8003 8 Colorado Alamosa County
2 3 8005 8 Colorado Arapahoe County
3 4 8005 8 Colorado Arapahoe County
4 5 8005 8 Colorado Arapahoe County

2. Geo-coordnates Soil sampling Sites

In [5]:
import pandas as pd

# Read GPS of soil sampling locations
f_gps = 'Data_07/GP_GPS.csv'
gps = pd.read_csv(f_gps)
gps.head()
Out[5]:
SiteID long lat
0 1 -104.444250 39.95648
1 2 -105.927230 37.60194
2 3 -104.840678 39.64298
3 4 -104.504870 39.61835
4 5 -104.421550 39.61246

3. Soil Carbon Data

In [6]:
import pandas as pd

#  Soil OC data
f_soc = 'Data_07/GP_SOC.csv'
soc = pd.read_csv(f_soc)
soc.head()
Out[6]:
SiteID SOC
0 1 2.304
1 2 3.154
2 3 1.309
3 4 4.717
4 5 2.230

4. Environmental Covariables

In [7]:
import pandas as pd

f_co = 'Data_07/gp_point_data.csv'
co_var = pd.read_csv(f_co)
# Convert NLCD to integer
co_var['NLCD']=co_var['NLCD'].astype(int)
co_var.head()
Out[7]:
SiteID DEM Slope MAP MAT NDVI NLCD
0 1 1551.130005 1.30141 384.401001 9.79839 0.331707 7
1 2 2308.689941 0.85811 194.281998 5.36274 0.430466 7
2 3 1689.930054 2.96257 469.255005 10.05500 0.471263 6
3 4 1822.170044 2.71405 470.238007 9.39621 0.396961 6
4 5 1758.680054 1.79807 439.721985 9.52607 0.348821 6

5. NLCD ID File

In [8]:
import pandas as pd

f_nlcd = 'Data_07/NLCD_ID.csv'
nlcd_id = pd.read_csv(f_nlcd)
nlcd_id.head()
Out[8]:
NLCD_DES NLCD
0 Forest 4
1 Herbaceous 6
2 Planted/Cultivated 7
3 Shrubland 5

Convert Pandas Data Frame to Geo Data Frame

A GeoDataFrame needs a shapely object. We use geopandas points_from_xy() to transform Longitude and Latitude into a list of shapely.Point objects and set it as a geometry while creating the GeoDataFrame. (note that points_from_xy() is an enhanced wrapper for [Point(x, y) for x, y in zip(df.Longitude, df.Latitude)])

In [9]:
import geopandas as gpd

gps_point = gpd.GeoDataFrame(
    gps, geometry=gpd.points_from_xy(gps.long, gps.lat))
In [10]:
gps_point.head()
Out[10]:
SiteID long lat geometry
0 1 -104.444250 39.95648 POINT (-104.44425 39.95648)
1 2 -105.927230 37.60194 POINT (-105.92723 37.60194)
2 3 -104.840678 39.64298 POINT (-104.84068 39.64298)
3 4 -104.504870 39.61835 POINT (-104.50487 39.61835)
4 5 -104.421550 39.61246 POINT (-104.42155 39.61246)
In [11]:
# Plot data
gps_point.plot()
Out[11]:
<AxesSubplot:>

Define Map Projection

We will transform point shape file from WGS 1984 to Albers Equal Area Conic NAD1983 (EPSG:5070). We could reproject the data from one CRS to another CRS using the geopandas method (i.e. function belonging to geopandas objects): to_crs(). Function .to_crs() will only work if your original spatial object has a CRS assigned to it AND if that CRS is the correct CRS for that data!

In [12]:
print(gps_point.crs)
None

We have define the projection before change CRS to projected coordinatesystem. We assign the WGS84 latitude-longitude CRS (EPSG:4326) to the crs attribute:

In [13]:
# Define CRS
gps_point.crs = ({'init' :'epsg:4326'})
E:\PyGeo\lib\site-packages\pyproj\crs\crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))

Now we will repoject gps_point data using the CRS of GP_STATE shape file. First we extract CRS from State shape file

In [14]:
import geopandas as gpd

# US State shape files
fn="Data_07/GP_STATE.shp"
GP_STATE= gpd.read_file(fn)
# CRS of state polygon
GP_crs=GP_STATE.crs
GP_crs
Out[14]:
<Projected CRS: PROJCS["Albers_Conical_Equal_Area",GEOGCS["NAD83", ...>
Name: Albers_Conical_Equal_Area
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: unnamed
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
In [15]:
# Repojection to  epsg:5070 projection
point_proj=gps_point.to_crs(GP_crs)
In [16]:
import matplotlib.pyplot as plt

# Make subplots that are next to each other
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 8))

# Plot the data in WGS84 CRS
gps_point.plot(ax=ax1, facecolor='gray');

# Add title
ax1.set_title("WGS84");

# Plot the one 
point_proj.plot(ax=ax2, facecolor='blue');

# Add title
ax2.set_title("Albers Equal Area Conic");

# Remove empty white space around the plot
plt.tight_layout()

Merging Tabular and Spatial Data

Attribute joins are accomplished using the merge() function. In general, it is recommended to use the merge method called from the spatial dataset. The stand-alone merge() function will work if the GeoDataFrame is in the left argument; if a DataFrame is in the left argument and a GeoDataFrame is in the right position, the result will no longer be a GeoDataFrame.

First we will create PD frame with all non-spatial data. We will use pd.merge() function to create a dataframe.

In [17]:
from functools import reduce

df1 = [id_df, soc,co_var]
df2 = reduce(lambda left,right: pd.merge(left,right,on='SiteID'), df1)
df2
Out[17]:
SiteID FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD
0 1 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7
1 2 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7
2 3 8005 8 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6
3 4 8005 8 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6
4 5 8005 8 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6
... ... ... ... ... ... ... ... ... ... ... ... ...
565 566 56045 56 Wyoming Weston County 4.701 1296.369995 2.50288 315.881012 7.58175 0.345824 6
566 567 56045 56 Wyoming Weston County 4.701 1296.369995 2.50288 315.881012 7.58175 0.345824 6
567 568 56045 56 Wyoming Weston County 4.701 1296.369995 2.50288 315.881012 7.58175 0.345824 6
568 569 56045 56 Wyoming Weston County 3.105 1286.000000 2.30247 331.578003 7.55435 0.352597 6
569 570 56045 56 Wyoming Weston County 5.600 1417.859985 4.09692 434.790009 6.45444 0.489789 5

570 rows × 12 columns

Now we add NLCD Description column to the data-frame

In [18]:
from functools import reduce

d3 = [df2, nlcd_id]
df = reduce(lambda left,right: pd.merge(left,right,on='NLCD'), d3)
df
Out[18]:
SiteID FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES
0 1 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated
1 2 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated
2 12 8009 8 Colorado Baca County 6.129 1221.199951 1.06476 443.065002 12.13140 0.366326 7 Planted/Cultivated
3 13 8009 8 Colorado Baca County 6.129 1221.199951 1.06476 443.065002 12.13140 0.366326 7 Planted/Cultivated
4 14 8009 8 Colorado Baca County 6.129 1221.199951 1.06476 443.065002 12.13140 0.366326 7 Planted/Cultivated
... ... ... ... ... ... ... ... ... ... ... ... ... ...
565 558 56043 56 Wyoming Washakie County 2.899 1380.719971 2.71461 246.516006 7.45715 0.171190 5 Shrubland
566 559 56043 56 Wyoming Washakie County 2.899 1380.719971 2.71461 246.516006 7.45715 0.171190 5 Shrubland
567 562 56043 56 Wyoming Washakie County 1.764 1421.709961 4.37551 290.014008 7.40410 0.212164 5 Shrubland
568 563 56043 56 Wyoming Washakie County 11.615 1631.560059 6.85482 378.261993 6.82645 0.298178 5 Shrubland
569 570 56045 56 Wyoming Weston County 5.600 1417.859985 4.09692 434.790009 6.45444 0.489789 5 Shrubland

570 rows × 13 columns

Now, we will merge GeoDataFrame (point_proj) with a pandas DataFrame (df) by SiteID of each each sampling location. Be

In [19]:
point_spdf = point_proj.merge(df, on='SiteID')
point_spdf.head()
Out[19]:
SiteID long lat geometry FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES
0 1 -104.444250 39.95648 POINT (-714067.484 1913846.289) 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated
1 2 -105.927230 37.60194 POINT (-866566.008 1663513.090) 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated
2 3 -104.840678 39.64298 POINT (-750759.166 1881917.141) 8005 8 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6 Herbaceous
3 4 -104.504870 39.61835 POINT (-722566.371 1876565.288) 8005 8 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6 Herbaceous
4 5 -104.421550 39.61246 POINT (-715564.605 1875277.417) 8005 8 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6 Herbaceous

Write as a ESRI Shape File

In [20]:
point_spdf.to_file("Data_07/gp_point_all_data.shp")

Extract Geometry From Spatial Point Data Frame

Now we will extract projected geometry (m) from the spatial point data-frame.

In [21]:
# Get X (m) and Y (m) Values from geometry column
point_spdf['x'] = point_spdf.geometry.x
point_spdf['y'] = point_spdf.geometry.y
point_spdf.head()
Out[21]:
SiteID long lat geometry FIPS STATE_ID STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES x y
0 1 -104.444250 39.95648 POINT (-714067.484 1913846.289) 8001 8 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated -714067.484244 1.913846e+06
1 2 -105.927230 37.60194 POINT (-866566.008 1663513.090) 8003 8 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated -866566.008141 1.663513e+06
2 3 -104.840678 39.64298 POINT (-750759.166 1881917.141) 8005 8 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6 Herbaceous -750759.166338 1.881917e+06
3 4 -104.504870 39.61835 POINT (-722566.371 1876565.288) 8005 8 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6 Herbaceous -722566.371080 1.876565e+06
4 5 -104.421550 39.61246 POINT (-715564.605 1875277.417) 8005 8 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6 Herbaceous -715564.604578 1.875277e+06

Convert Geo Data Frame to Pandas Data Frame

Now we will convert geopandas data frame to pandas data frame without geometry column. We will use this data-frame for futher analysis.

In [22]:
import pandas as pd

point_df = pd.DataFrame(point_spdf.drop(columns=['geometry']))

# Organize variables
point_mf= point_df[['SiteID', 'long', 'lat', 'x', 'y', 'FIPS','STATE', 'COUNTY', 
    'SOC', 'DEM','Slope','MAP', 'MAT','NDVI','NLCD','NLCD_DES']]
point_mf.head()
Out[22]:
SiteID long lat x y FIPS STATE COUNTY SOC DEM Slope MAP MAT NDVI NLCD NLCD_DES
0 1 -104.444250 39.95648 -714067.484244 1.913846e+06 8001 Colorado Adams County 2.304 1551.130005 1.30141 384.401001 9.79839 0.331707 7 Planted/Cultivated
1 2 -105.927230 37.60194 -866566.008141 1.663513e+06 8003 Colorado Alamosa County 3.154 2308.689941 0.85811 194.281998 5.36274 0.430466 7 Planted/Cultivated
2 3 -104.840678 39.64298 -750759.166338 1.881917e+06 8005 Colorado Arapahoe County 1.309 1689.930054 2.96257 469.255005 10.05500 0.471263 6 Herbaceous
3 4 -104.504870 39.61835 -722566.371080 1.876565e+06 8005 Colorado Arapahoe County 4.717 1822.170044 2.71405 470.238007 9.39621 0.396961 6 Herbaceous
4 5 -104.421550 39.61246 -715564.604578 1.875277e+06 8005 Colorado Arapahoe County 2.230 1758.680054 1.79807 439.721985 9.52607 0.348821 6 Herbaceous

Write as a CSV File

In [23]:
point_mf.to_csv('Data_07/gp_all_data.csv', header =True, index=False)

Part 2: Data Visualization

Boxplot

Boxplot with Seaborn

In [24]:
import seaborn as sns

p, ax = plt.subplots(figsize=(10, 6))
ax=sns.boxplot(y="SOC", 
            x="NLCD_DES", 
            data=point_mf,
            palette="Blues_d")

ax.axes.xaxis.label.set_text("NLCD")
ax.axes.yaxis.label.set_text("Soil OC (mg/g)")
#plt.title('Mean ± SE of Soil OC')

Interactive Boxplot with Plotly

In [25]:
import plotly.express as px

fig = px.box(point_mf, 
             x="NLCD_DES", 
             y="SOC",
             width=800, 
             height=500)
fig.update_layout(
    title={
        'text': "Sourface Soil Organic (mg/g)",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
     xaxis_title="NLCD",
     yaxis_title="Soil OC",
     font=dict(
        family="sans-serif, monospace",
        size=18,
        color="#7f7f7f"))

fig.show()